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Abstract 

Inference and learning of graphical models 
are both well-studied problems in statistics 
and machine learning that have found many 
applications in science and engineering. How- 
ever, exact inference is intractable in gen- 
eral graphical models, which suggests the 
problem of seeking the best approximation 
to a collection of random variables within 
some tractable family of graphical models. 
In this paper, we focus our attention on the 
class of planar Ising models, for which in- 
ference is tractable using techniques of sta- 
tistical physics [Kac and Ward; Kasteleyn]. 
Based on these techniques and recent meth- 
ods for planarity testing and planar embed- 
ding [Chrobak and Payne] , we propose a sim- 
ple greedy algorithm for learning the best 
planar Ising model to approximate an arbi- 
trary collection of binary random variables 
(possibly from sample data). Given the set 
of all pairwise correlations among variables, 
we select a planar graph and optimal pla- 
nar Ising model defined on this graph to best 
approximate that set of correlations. We 
demonstrate our method in some simulations 
and for the application of modeling senate 
voting records. 



1 Introduction 

Graphical models |Lau96[ IMac03j are widely used to 
represent the statistical relations among a set of ran- 
dom variables. Nodes of the graph correspond to ran- 



dom variables and edges of the graph represent statis- 
tical interactions among the variables. The problems 
of inference and learning on graphical models are en- 
countered in many practical applications. The prob- 
lem of inference is to deduce certain statistical proper- 
ties (such as marginal probabilities, modes etc.) of a 
given set of random variables whose graphical model is 
known. It has wide applications in areas such as error 
correcting codes, statistical physics and so on. The 
problem of learning on the other hand is to deduce 
the graphical model of a set of random variables given 
statistics (possibly from samples) of the random vari- 
ables. Learning is also a widely encountered problem 
in areas such as biology, anthropology and so on. 

A certain class of binary-variable graphical models 
with pairwise interactions known as the Ising model 
has been studied by physicists as a simple model 
of order-disorder transitions in magnetic materials 
|Ons44j . Remarkably, it was found that in the special 
case of an Ising model with zero- mean { — 1, +1} binary 
random variables and pairwise interactions defined on 
a planar graph, calculation of the partition function 
(which is closely tied to inference) is tractable, essen- 
tially reducing to calculation of a matrix determinant 
(IKW-jLH ISheoOl IKas631 lFis66j L These methods have 
recently found uses in machine learning |SK08[ IGJ07] . 

In this paper, we address the problem of approximat- 
ing a collection of binary random variables (given their 
pairwise marginal distributions) by a zero-mean planar 
Ising model. We also consider the related problem of 
selecting a non-zero mean Ising model defined on an 
outer-planar graph (these models are also tractable, 
being essentially equivalent to a zero-field model on a 
related planar graph). 

There has been a great deal of work on learning graph- 
ical models. Much of these has focused on learning 
over the class of thin graphical models [BJ011 IKS011 
SCG09] for which inference is tractable by convert- 
ing the model to tree-structured model. The sim- 
plest case of this is learning tree models (treewidth 
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one graphs) for which it is tractable to find the best 
tree model by reduction to a max-weight spanning 
tree problem }CL68j . However, the problem of find- 
ing the best bounded-treewidth model is NP-hard for 
treewidths greater than two [KSOl], and so heuris- 
tic methods are used to select the graph structure. 
One popular method is to use convex optimization 
of the log-likelihood penalized by l\ norm of param- 
eters of the graphical model so as to promote spar- 
sity }BEd08[ ILGK06j . To go beyond low treewidth 
graphs, such methods either focus on Gaussian graph- 
ical models or adopt a tractable approximation of the 
likelihood. Other methods seek only to learn the graph 
structure itself |RWL10|HXN06] and are often able to 
demonstrate asymptotic correctness of this estimate 
under appropriate conditions. One useful application 
of learning Ising models is for modeling interactions 
among neurons [CLM07 . 

The rest of the paper is organized as follows: We 
present the requisite mathematical preliminaries in 
Section 2. Section 3 contains our algorithm along with 
estimates of its computational complexity. We present 
simulation results in Section 4 and an application to 
the senate voting record in Section 5. We conclude in 
Section 5 and suggest promising directions for further 
research and development. All the proofs of proposi- 
tions are delegated to an appendix. 

2 Preliminaries 

In this section, we develop our notation and briefly 
review the necessary background theory. We will 
be dealing with binary random variables throughout 
the paper. We write P(x) to denote the probabil- 
ity distribution of a collection of random variables 
x = (xi, . . . ,x n ). Unless otherwise stated, we work 
with undirected graphs G = (V, E) with vertex (or 
node) set V and edges € E C (Y). For ver- 

tices i,j € V we write G + ij to denote the graph 
(V,EU{i,j}). 

A (pairwise) graphical model is a probability distribu- 
tion P{x) = P(xi, . . . ,x n ) that is defined on a graph 
G = (V, E) with vertices V = {1, .., n} as 

P(x) CX PJt/'t^t) II 1pij{Xi,X.j) 

oc exp{y^ fi(xj) + fij( X{ , Xj 

)} (1) 

i€V {i,j}eE 

where ipi, ipij > are non- negative node and edge com- 
patibility functions. For positive ?/>'s, we may also rep- 
resent P(x) as a Gibbs distribution with potentials 
fi = log ipi and fij = log ipij. 



2.1 Entropy, Divergence and Likelihood 

For any probability distribution P on some sample 
space x, its entropy is defined as [CT06 

H(P) = -^P(z)logP(z) 

Suppose we want to calculate how well a probability 
distribution Q approximates another probability dis- 
tribution P (on the same sample space x). For any 
two probability distributions P and Q on some sample 
space x, we denote by D(P,Q) the Kullback-Leibler 
divergence (or relative entropy) between P and Q. 

^,Q) = £P(*)log^ 

The log-likelihood function is defined as follows: 

LL(P,Q) =Y,P(x) log Q(x) 

xex 

The probability distribution in a family J- that max- 
imizes the log-likelihood of a probability distribution 
P is called the maximum-likelihood estimate of P in 
J- , and this is equivalent to the minimum-divergence 
projection of P to T: 

Pjr = arg max LL(P, Q) = argminD(P, Q) 

2.2 Exponential Families 

A set of random variables x — (xi, . . . , x n ) are said to 
belong to an exponential family [BN79, WJ08 if there 
exist functions 0i , • • • , 4> m (the features of the family) 
and scalars {parameters) 9%, ■ ■ ■ , 9 m such that the joint 
probability distribution on the variables is given by 



P(x) = 




where Z(0) is a normalizing constant called the parti- 
tion function. This corresponds to a graphical model 
if <j> a happen to be functions on small subsets (e.g., 
pairs) of all the n variables. The graph corresponding 
to such a probability distribution is the graph where 
two nodes have an edge between them if and only if 
there exists a such that (f> a depends on both variables. 
If the functions {4> a } are non-degenerate (please refer 
to [WJ08] for details), then for any achievable moment 
parameters [i — Ep[</>] (for an arbitrary distribution 
P) there exists a unique parameter vector 0(p) that 
realizes these moments within the exponential family. 

Let $(0) denote the log-partition function 
${6)±logZ(6) 
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For an exponential family, we have the following im- 
portant relation (of conjugate duality) between the log- 
partition function and negative entropy of the corre- 
sponding probability distribution H(fi) = H(Pg^) as 
follows |WJ08j 

$*(//) = max {^ T 9 - $(0)} = -H{fi) (2) 

if the mean parameters /i are achievable under some 
probability distribution. In fact, this corresponds 
to the problem of maximizing the log-likelihood rel- 
ative to an arbitrary distribution P with moments 
fi = Kp[4>] over the exponential family. The optimal 
choice of realizes the given moments (Eg[</>] = /i) 
and this solution is unique for non-degenerate choice 
of features. 

2.3 Ising Graphical Model 

The Ising model is a famous model in statistical 
physics that has been used as simple model of mag- 
netic phenomena and of phase transitions in complex 
systems. 

Definition 1. An Ising model on binary random vari- 
ables x = (xx, . . . ,x n ) and graph G = (V,E) is the 
probability distribution defined by 

p ( x ) = TFT^y exp 9 i x i + e H x i x 3 
1 ' \*ev {i,j}eE J 

where Xi G { — 1,1}. Thus, the model is specified by 
vertex parameters 9i and edge parameter 0y . 

This defines an exponential family with non- 
degenerate features {4>i(x) = Xi,i € V) and (<fiij(x) = 
XiXj,{i,j} G E) and with corresponding moments 
(/Zi = E[xi],i G V) and Qj+j = E[xiXj],{i,j} G E). 

In fact, any graphical model with binary variables and 
soft pairwise potentials can be represented as an Ising 
model with binary variables Xi = {— 1, +1} and with 
parameters 

@i 2 ^ \ •Ejfi ( x i ) ~t~ 4 ^ ^ ^ ^ x ifij( x i-) x j) 
Xi {i j'}G-E xt,Xj 

6ij j ^ XiXjfij(xi 7 Xj). 

There is also a simple correspondence between the mo- 
ment parameters of the Ising model and the node and 
edge-wise marginal distributions. Of course, it is triv- 
ial to compute the moments given these marginals: 

Mi = Ylxi x iP( x i) an d A% = J2 Xi ,x 3 XiXjP(Xi, Xj). 

The marginals are recovered from the moments by: 
P(xi) = \{l + fiiXi) 



P^Xij Xj) ^ (1 ~(~ f-liX{ -\- fJjjXj -\- flijXiXj) 

We will be especially concerned with the following sub- 
family of Ising models: 

Definition 2. An Ising model is said to be zero-field 
if 9i = for all i G V . It is zero-mean if [ii = 
(P(x 4 = ±1) = \) for all i G V. 

It is simple to verify that the Ising model is zero-field 
if and only if it is zero-mean. Although the assump- 
tion of zero-field appears very restrictive, a general 
Ising model can be represented as a zero-field model by 
adding one auxiliary variable node connected to every 
other node of the graph [G J07 . The parameters and 
moments of the two models are then related as follows: 

Proposition 1. Consider the Ising model on G = 
(V, E) with V = {1, . . . , n}, parameters {0^} and {0^}, 
moments {/ii} and } and partition function Z . Let 
G = (V , E) denote the extended graph based on nodes 
V = VU{n+l} with edges E = EU{{i,n+l},i G V}). 
We define a zero-field Ising model on G with param- 
eters moments {fiij} and partition function Z. 
If we set the parameters according to 

p.. _ f 0i if 3 =n + l 
' J I ®ij otherwise 

then Z = 2Z and 

a - = { Mi */i = «+l 
\ Mij otherwise 

Thus, inference on the corresponding zero-field Ising 
model on the extended graph G is essentially equiv- 
alent to inference on the (non-zero-field) Ising model 
defined on G. 

2.4 Inference for Planar Ising Models 

The motivation for our paper is the following result on 
tractability of inference for the planar zero-field Ising 
model. 

Definition 3. A graph is planar if it may be embedded 
in the plane without any edge crossings. 

Moreover, it is known that any planar graph can be 
embedded such that all edges are drawn as straight 
lines. 

Theorem 1. \KW5ty\She60\j Let Z denote the par- 
tition function of a zero-field Ising model defined on 
a planar graph G = (V,E). Let G be embedded in 
the plane (with edges drawn as straight lines) and let 
(frijk G [— 7r, +7r] denote the angular difference (turn- 
ing angle) between directed edges (i,j) and (j,k). We 
define the matrix W G C 2 ' £; ' x2 '" B ' indexed by directed 



Learning Planar Ising Models 



edges of the graph as follows: W = AD where D is the 
diagonal matrix with P>ij,ij = tanh 0y = and 

A ={ ^{kV-A-'j'iji), j = kandi^l 
ij,kl ~ | 0, " otherwise 

Then, the partition function of the zero-field planar 
Ising model is given by: 

Z = 2" Yl cosh % det ( J - W ) * 

We briefly remark the combinatorial interpretation of 
this theorem: W is the generating matrix of non- 
reversing walks of the graph with the weight of a walk 
7 being 

w 1 = Yi w v II exp(i\/^T0j j/c ). 

0j)£7 (*,J)A)67 

The determinant can be interpreted as the (inverse) 
graph zeta function: det(I — W) = Jl 7 (l — w i) where 
the product is taken over all equivalence classes of ape- 
riodic closed non-reversing walks |She601 ILoelOj . A 
related method for computing the Ising model parti- 
tion function is based on counting perfect matching of 
planar graphs [Kas63, Fis66j. We favor the Kac-Ward 
approach only because it is somewhat more direct. 

Since calculation of the partition function reduces to 
calculating the determinant of a matrix, one may 
use standard Gaussian elimination methods to eval- 
uate this determinant with complexity 0(n 3 ). In 
fact, using the generalized nested dissection algo- 
rithm to exploit sparsity of the matrix, the complex- 
ity of these calculations can be reduced to 0(n 3 / 2 ) 
[LRT791 ILT791 IGLVOOj . Thus, inference of the zero- 
field planar Ising model is tractable and scales well 
with problem size. 

It also turns out that the gradient and Hessian of the 
log-partition function $(0) = log Z(9) can be calcu- 
lated efficiently from the Kac-Ward determinant for- 
mula. We recall that derivatives of $(#) recover the 
moment parameters of the exponential family model 

[ENS3EEEB]: 

V$(0) = E e [^] = M . 

Thus, inference of moments (and node and edge 
marginals) are likewise tractable for the zero-field pla- 
nar Ising model. 

Proposition 2. Let fi = V$(0), H = V 2 $(6>). Let 
S = (I - W)^A and T = (I + P)(S o S T )(I + P T ) 
where A and W are defined as in Theorem 1, o denotes 
the element-wise product and P is the permutation ma- 
trix swapping indices of directed edges and (J,i). 



Then, 

Mij = w ij ~ |(1 ~ w ij)(Sij,ij + 

_ ( 1 - if if = kl, else 

Note, calculating the full matrix S requires 0(n 3 ) cal- 
culations. However, to compute just the moments fi 
only the diagonal elements of S are needed. Then, 
using the generalized nested dissection method, infer- 
ence of moments (edge- wise marginals) of the zero-field 
Ising model can be achieved with complexity 0(n 3 / 2 ). 
However, computing the full Hessian is more expen- 
sive, requiring 0(n 3 ) calculations. 

Inference for Outer-Planar Graphical Models 

We emphasize that the above calculations require both 
a planar graph G and a zero-field Ising model. Us- 
ing the graphical transformation of Proposition 1, the 
latter zero-field condition may be relaxed but at the 
expense of adding an auxiliary node connected to all 
the other nodes. In general planar graphs G, the new 
graph G may not be planar and hence may not admit 
tractable inference calculations. However, for the sub- 
set of planar graphs where this transformation does 
preserve planarity inference is still tractable. 

Definition 4. A graph G is said to be outer-planar if 
there exists an embedding of G in the plane where all 
the nodes are on the outer face. 

In other words, the graph G is outer-planar if the ex- 
tended graph G (defined by Proposition 1) is planar. 
Then, from Proposition 1 and Theorem 1 it follows 
that: 

Proposition 3. 'GJOll The partition function and 
moments of any outer-planar Ising graphical model 
(not necessarily zero-field) can be calculated efficiently. 
Hence, inference is tractable for any binary-variable 
graphical model with pairwise interactions defined on 
an outer-planar graph. 

This motivates the problem of learning outer-planar 
graphical models for a collection of (possibly non-zero 
mean) binary random variables. 

3 Learning Planar Ising Models 

This section addresses the main goals of the paper, 
which are two-fold: 

1. Solving for the maximum-likelihood Ising model 
on a given planar graph to best approximate a 
collection of zero-mean random variables. 
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2. How to select (heuristically) the planar graph to 
obtain the best approximation. 

We address these respective problems in the follow- 
ing two subsections. The solution of the first prob- 
lem is an integral part of our approach to the sec- 
ond. Both solutions are easily adapted to the context 
of learning outer-planar graphical models of (possibly 
non-zero mean) binary random variables. 

3.1 ML Parameter Estimation 

As discussed in Section 2.2, maximum- likelihood es- 
timation over an exponential family is a convex opti- 
mization problem ^ based on the log-partition func- 
tion <&(#). In the case of the zero-field Ising model de- 
fined on a given planar graph it is tractable to compute 
via a matrix determinant described in Theorem 
1. Thus, we obtain an unconstrained, tractable, con- 
vex optimization problem for the maximum-likelihood 
zero-field Ising model on the planar graph G to best 
approximate a probability distribution P(x): 

max — log cosh %) — \ logdet(J — W(9))} 

ij 

Here, /i^- = Ep^jXj] for all edges {i,j} € G and 
the matrix W(9) is as defined in Theorem 1. If P 
represents the empirical distribution of a set of inde- 
pendent identically-distributed (iid) samples {x^ s \ s = 
1, . . . , S} then {fiij} are the corresponding empirical 
moments /Lty = ^ ^ s x^Xj. 

Newton's Method We solve this unconstrained 
convex optimization problem using Newton's method 
with step-size chosen by back-tracking line search 
BV04]. This produces a sequence of estimates 9^ 
calculated as follows: 

6 (s+i) _ 0« + \ s H(9^ s) )- 1 {p{9^) - //) 

where fi{9^) and H(9^) are calculated using Propo- 
sition 2 and A s € (0, 1] is a step-size parameter chosen 
by backtracking line search (see |BV04j for details). 
The per iteration complexity of this optimization is 
0(n 3 ) using explicit computation of the Hessian at 
each iteration. This complexity can be offset some- 
what by only re-computing the Hessian a few times 
(reusing the same Hessian for a number of iterations) , 
to take advantage of the fact that the gradient compu- 
tation only requires 0{n^) calculations. As Newton's 
method has quadratic convergence, the number of it- 
erations required to achieve a high-accuracy solution 
is typically 8-16 iterations (essentially independent of 
problem size). We estimate the computational com- 
plexity of solving this convex optimization problem as 
roughly 0(n 3 ). 



3.2 Greedy Planar Graph Selection 

We now consider the problem of selection of the planar 
graph G to best approximate a probability distribu- 
tion P{x) with pairwise moments /i^ = Ep[xjXj] given 
for all i,j G V. Formally, we seek the planar graph 
that maximizes the log-likelihood (minimizes the di- 
vergence) relative to P: 

G = arg max LL(P, Pq) = argmax max LL(P, Q) 
Gev v Gev v Q eJr o 

where Vy is the set of planar graphs on the vertex set 
V, Tg denotes the family of zero-field Ising models 
defined on graph G and Pq = ar g max Qe^ G LL(P, Q) 
is the maximum-likelihood (minimum-divergence) ap- 
proximation to P over this family. 

We obtain a heuristic solution to this graph selection 
problem using the following greedy edge-selection pro- 
cedure. The input to the algorithm is a probabil- 
ity distribution P (which could be empirical) on n 
binary {—1, 1} random variables. In fact, it is suf- 
ficient to summarize P by its pairwise correlations 
fiij = Mp[xiXj] on all pairs i,j 6 V. The output is a 
maximal planar graph G and the maximum-likelihood 
approximation 9q to P in the family of zero-field Ising 
models defined on this graph. 



Algorithm 1 GreedyPlanarGraphSelect(P) 

1: G = 0,# G = 

2: for k = 1 : 3n — 6 do 

3: A = {{i,j} c V\{i,j} £G,G + ij£ Vv} 
4: jEtA = {fiij = Ee G [xiXj],{i,j} € A} 
5: G <- G U arg max D(P e , P e ) 

6: 9 G = PlanarIsing(G, P) 

7: end for 



The algorithm starts with an empty graph and then 
sequentially adds edges to the graph one at a time so as 
to (heuristically) increase the log-likelihood (decrease 
the divergence) relative to P as much as possible at 
each step. Here is a more detailed description of the 
algorithm along with estimates of the computational 
complexity of each step: 

• Line 3. First, we enumerate the set A of all edges 
one might add (individually) to the graph while 
preserving planarity. This is accomplished by an 
0(n 3 ) algorithm in which we iterate over all pairs 
{i,j} & G and for each such pair we form the 
graph G + ij and test planarity of this graph using 
known 0(n) algorithms CP95J. 

• Line 4- Next, we perform tractable inference cal- 
culations with respect to the Ising model on G to 
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calculate the pairwise correlations jlij for all pairs 
{i, j} G A. This is accomplished using (3(n 3 / 2 ) in- 
ference calculations on augmented versions of the 
graph G. For each inference calculation we add 
as many edges to G from A as possible (setting 
9 = on these edges) while preserving planarity 
and then calculate all the edge-wise moments of 
this graph using Proposition 2 (including the zero- 
edges). This requires at most O(n) iterations to 
cover all pairs of A, so the worst-case complex- 
ity to compute all required pairwise moments is 
0(n 5 / 2 ). 

• Line 5. Once we have these moments, which 
specify the corresponding pairwise marginals of 
the current Ising model, we compare these mo- 
ments (pairwise marginals) to those of the input 
distribution P by evaluating the pairwise KL- 
divergence between the Ising model and P. As 
seen by the following proposition, this gives us 
a lower-bound on the improvement obtained by 
adding the edge: 

Proposition 4. Let Pq and Pa+ij be the projec- 
tions of P on G and G + ij respectively. Then, 

D(P,P G )-D(P,P G+ij ) > D(P(xi,a? J ),Po(s i ,s i )) 

where P(xi,Xj) and Pc{xi,Xj) represent the 
marginal distributions on Xi,Xj of probabilities P 
and Pq respectively. 

Thus, we greedily select the next edge {i,j} to 
add so as to maximize this lower-bound on the 
improvement measured by the increase on log- 
likelihood (this being equal to the decrease in KL- 
divergence) . 

• Line 6. Finally, we calculate the new maximum- 
likelihood parameters Qq on the new graph G <— 
G + ij. This involves solution of the convex opti- 
mization problem discussed in the preceding sub- 
section, which requires 0(n 3 ) complexity. This 
step is necessary in order to subsequently calcu- 
late the pairwise moments fx which guide further 
edge-selection steps, and also to provide the final 
estimate. 

We continue adding one edge at a time until a maximal 
planar graph (with in — 6 edges) is obtained. Thus, 
the total complexity of our greedy algorithm for planar 
graph selection is 0{n ). 

Non-Maximal Planar Graphs Since adding an 
edge always gives an improvement in the log- 
likelihood, the greedy algorithm always outputs a max- 
imal planar graph. However, this might lead to over- 



fitting of the data especially when the input proba- 
bility distribution corresponds to an empirical distri- 
bution. In such cases, to avoid over-fitting, we might 
modify the algorithm so that an edge is added to the 
graph only if the improvement in log- likelihood is more 
than some threshold 7. An experimental search can be 
performed for a suitable value of this threshold (e.g. 
so as to minimize some estimate of the generalization, 
such as in cross validation methods Zha93 ). 

Outer-Planar Graphs and Non-Zero Means 

The greedy algorithm returns a zero-field Ising model 
(which has zero mean for all the random variables) de- 
fined on a planar graph. If the actual random variables 
are non-zero mean, this may not be desirable. For 
this case we may prefer to exactly model the means of 
each random variable but still retain tractability by re- 
stricting the greedy learning algorithm to select outer- 
planar graphs. This model faithfully represents the 
marginals of each random variable but at the cost of 
modeling fewer pairwise interactions among the vari- 
ables. 

This is equivalent to the following procedure. First, 
given the sample moments {/^;} and {/%•} we convert 
these to an equivalent set of zero-mean moments fi on 
the extended vertex set V = V U {n + 1} according 
to Proposition 1. Then, we select a zero- mean pla- 
nar Ising model for these moments using our greedy 
algorithm. However, to fit the means of each of the 
original n variables, we initialize this graph to include 
all the edges {i, n + 1} for all i E V. After this ini- 
tialization step, we use the same greedy edge-selection 
procedure as before. This yields the graph G and pa- 
rameters 9q. Lastly, we convert back to a (non-zero 
field) Ising model on the subgraph of G defined on 
nodes V, as prescribed by Proposition 1. The resulting 
graph G and parameters Qq is our heuristic solution 
for the maximum-likelihood outer-planar Ising model. 

Lastly, we remark that it is not essential that one 
chooses between the zero-field planar Ising model and 
the outer-planar Ising model. We may allow the 
greedy algorithm to select something in between — a 
partial outer-planar Ising model where only nodes of 
the outer-face are allowed to have non-zero means. 
This is accomplished simply by omitting the initial- 
ization step of adding edges {i, n + 1} for alH e V. 

4 Simulations 

In this section, we present the results of numerical ex- 
periments evaluating our algorithm. 

Counter Example The first result, presented in 
Figure 1 illustrates the fact that our algorithm does 
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not always recover the exact structure even when the 
underlying graph is planar and the algorithm is given 
exact moments as inputs. 




Figure 1: Counter example : (a) Original graphical 
model (b) Recovered graphical model. The recovered 
graphical model has one spurious edge {a, e} and one 
missing edge {c,d}. It is clear from this example that 
our algorithm is not always optimal. 



We define a zero-field Ising model on the graph in Fig- 
ure l(a)| with the edge parameters as follows: 6^ = 
c d = ®bd = 0.1 and 9^ = 1 for all the other edges. 
Figure 1(a) shows the edge parameters in the graph 
pictorially using the intensity of the edges - higher the 
intensity of an edge, higher the corresponding edge 
parameter. When the edge parameters are as chosen 
above, the correlation between nodes a and e is greater 
than the correlation between any other pair of nodes. 
This leads to the edge between a and e to be the first 
edge added in the algorithm. However, since K5 (the 
complete graph on 5 nodes) is not planar, one of the 
actual edges is missed in the output graph. Figure 



1(b) shows the edge weighted recovered graph. 



Recovery of Zero-Field Planar Ising Model We 

now present the results of our experiments on a zero 
field Ising model on a 7 x 7 grid. The edge parameters 
are chosen to be uniformly random between —1 and 1 
with the condition that the absolute value be greater 
than a threshold (chosen to be 0.05) so as to avoid 
edges with negligible interactions. We use Gibbs sam- 
pling to obtain samples from this model and calculate 
empirical moments from these samples which are then 
passed as input to our algorithm. The results are seen 
in Figure 2 (see caption for details). 

Recovery of Non-Zero-Field Outer Planar Ising 
Model As explained in Section |3.2| our algorithm 
can also be used to find the best outer planar graphical 
model describing a given empirical probability distri- 
bution. In this section, we present the results of our 
numerical experiments on a 12 node outer planar bi- 
nary pairwise graphical model where the nodes have 
non-zero mean. Though our algorithm gives perfect 
reconstruction on graphs with many more nodes, we 
choose a small example to illustrate the result effec- 
tively. We again use Gibbs sampling to obtain samples 
and calculate empirical moments from those samples. 




Figure 2: 7x7 grid : (a) Original graphical model 
(b) Recovered graphical model (10 4 samples) (c) Re- 
covered graphical model (10 5 samples). The inputs 
to the algorithm are the empirical moments obtained 
from the samples. The algorithm is stopped when the 
recovered graph has the same number of edges as the 
original graphical model. With 10 4 samples, there are 
some errors in the recovered graphical model. When 
the number of samples is increased to 10 5 , we see per- 
fect recovery. 



Figure 3(a) presents the original graphical model. Fig- 
ures |3(b) and 3(c) present the output graphical models 
for 10 3 and 10 4 samples respectively. We make sure 
that the first moments of all the nodes arc satisfied by 
starting with the auxiliary node connected to all other 
nodes. When the number of samples is 10 3 , the num- 
ber of erroneous edges in the output as depicted by 
Figure [3(b)| is 0.18. However, as the number of sam- 
ples increases to 10 4 , the recovered graphical model in 
Figure |3(c)| is exactly the same as the original graphi- 
cal model. 




Figure 3: Outer planar graphical model : (a) Original 
graphical model (b) Recovered graphical model (10 3 
samples) (c) Recovered graphical model (10 4 samples). 
Even in this case, the number of erroneous edges in 
the output of our algorithm decreases with increasing 
number of samples. With 10 4 samples, we recover the 
graphical model exactly. 



5 An Example Application: Modeling 
Correlations of Senator Voting 

In this section, we use our algorithm in an interesting 
application to model correlations of senator voting fol- 
lowing Banerjee et al. |BEd08j . We use the senator 
voting data for the years 2009 and 2010 to calculate 
the correlations between the voting patterns of dif- 
ferent senators. A Yea vote is treated as +1 and a 
Nay vote is treated as —1. We also consider non- votes 
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Nelson (D-FL) 



Figure 4: Graphical model representing the senator voting pattern : The blue nodes represent democrats, the red 
nodes represent republicans and the black node represents an independent. The above graphical model conveys 
many facts that are already known to us. For instance, the graph shows Sanders with edges only to democrats 
which makes sense because he caucuses with democrats. Same is the case with Lieberman. The graph also shows 
the senate minority leader McConnell well connected to other republicans though the same is not true of the 
senate majority leader Reid. We use the graph drawing algorithm of Kamada and Kawai [KK89 . 



as —1, but only consider those senators who voted in 
atleast | of the votes under consideration to limit bias. 
We run our algorithm on the correlation data to ob- 
tain the maximal planar graph modeling the senator 
voting pattern which is presented in Figure [5j 

6 Conclusion 

We have proposed a greedy heuristic to obtain the 
maximum-likelihood planar Ising model approxima- 
tion to a collection of binary random variables with 
known pairwise marginals. The algorithm is simple 
to implement with the help of known methods for 
tractable inference in planar Ising models, efficient 
methods for planarity testing and embedding of pla- 
nar graphs. We have presented simulation results of 
our algorithm on sample data and on the senate voting 
record. 

Future Work Many directions for further work are 
suggested by the methods and results of this paper. 
Firstly, we know that the greedy algorithm is not guar- 
anteed to find the best planar graph. Hence, there are 
several strategies one might consider to further refine 
the estimate. One is to allow the greedy algorithm to 
also remove previously-added edges which prove to be 



less important than some other edge. 

It may also be possible to use some more generalized 
notion of local search, such as adding/removing multi- 
ple edges at a time such as searching the space of maxi- 
mal planar graphs by considering "edge flips" , that is, 
replacing an edge by an orthogonal edge connecting 
opposite vertices of the two adjacent faces. One could 
also consider randomized search strategies such as sim- 
ulated annealing or genetic programming in the hope 
of escaping local minima. Another limitation of our 
current framework is that it only allows learning pla- 
nar graphical models on the set of observed random 
variables and, moreover, requires that all variables are 
observed in each sample. One could imagine exten- 
sions of our approach to handle missing samples (us- 
ing the expectation-maximization approach) or to try 
to identify hidden variables that were not seen in the 
data. 
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Supplementary Material (Proofs) 

Proposition 1. Let the probability distributions cor- 
responding to G and G be P and P respectively and 
the corresponding expectations be E and E respec- 
tively. For the partition function, we have that 



Z = ^ CX P ( E 



E exp x n+ i ^2 °i x i + E 



iev 



exp QiXi + OijXiXj 

{i,j}£E 



E ex p -E< 



iev 



E < 

{i,j'}6-B 



7 XiXj 



2 E ex p E< 



E < 



; ;X ; .7. ; 



2Z 



where the fourth equality follows from the symmetry 
between —1 and 1 in an Ising model. 

For the second part, since P is zero-field, we have that 

E[xi] =0VieV 

Now consider any {i,j} G E. If is fixed to a value 
of 1, then the model is the same as original on V and 
we have 

¥\xiXj | x n +\ = 1] = E^i^j] V {i, j} € E 

By symmetry (between —1 and 1) in the model, the 
same is true for x n +i = — 1 and so we have 

E[xiXj] 

= E[xiXj | x n+ i = l]P(x n+ i = 1) 
+E[xiXj | x n+ i = -l]P(x n+ i = -1) 
= E[xjXj] 

Fixing x n+ i to a value of 1, we have 

E[ Xi \x n+1 = l]=E[x l ]VieV 

and by symmetry 

E[ Xl | x n+1 = -1] = -E[ Xi ] VieV 
Combining the two equations above, we have 

E[xiX n+1 ] 

= EJZj | X n+1 = l]P{Xn+l = 1) 

+E[-Xi | x n+ i = -l]P(x n+ i = -1) 
= E[xi] 

□ 



Proposition 2. From Theorem 1, we see that the log 
partition function can be written as 

$(0) = n log 2+ ^2 lo S cosh % + o log det ( 7 ~~ AD ^ 

{i,j}£E 

where A and D are as given in Theorem 1. For the 
derivatives, we have 



d<S>(6) 
d9ij 



tanh% + iTr ((/ - AD) 



-1 d(I-AD) 



tanh% - ^Tr ((J - AD)~ 1 AD' i ^) 



where _D,' is the derivative of the matrix £> with re- 
spect to % . The first equality follows from chain rule 
and the fact that VK = K^ 1 for any matrix K. Please 
refer [BV04] for details. 

For the Hessian, we have 

d 2 jKg) _ i d 2 z(8) i f dz{e) \ 2 
oe 2 . ~ z{6) ae 2 . z(sy \ ae^ ) 

For {i,j} 7^ {fc, /}, following |B V04j . we have 



=-lTr (SD' ijS D' k 



kl) 



g(l — u^) {Sij t kiSki,n + Sji.kiS, 



(eij + e ji ) T {S o S T )(e k i + e lk ) 
(S o S T ) lhU + (S o S T )ijj k 



+ SijJ k Slk,ij + Sji.lkSlkJi) (1 — w fci ) 

On the other hand, we also have 

% w - eg (I + P) (5 o 5 T K7 + P)e fe; 

+(5 o S T ) jiM + (S o S T ) jit ik 

th 

where is the unit vector with 1 in the position 
and everywhere else. Using the above two equations, 
we obtain 



H ljM = --(1 - ^)Py,fe/(l - w^) 



□ 



Proposition 4. The proof follows from the following 
steps of inequalities. 

D(P, P G ) - D(P, P G+ij ) + D(P G+ij , P G ) 
= D(P,P G+lj )+ 

D(P G+i j(xi,Xj),P G (xi,Xj))+ 

D(P G+ ij{x V -ij),P G (x V -ij)) 
>D(P,P G+ij )+ 

D (P G+ ij (xi,Xj),P G (xi,Xj))+ 
>D{P,P G+lj ) + 

D(P(xi, Xj), P G (xi,Xj)) 
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where the first step follows from the Pythagorean law 
of information projection [AKN92J, the second step 
follows from the conditional rule of relative entropy 
[CT06| . the third step follows from the information 
inequality |CT06) and finally the fourth step follows 
from the property of information projection to G + ij 
[W.I 08 j . □ 



